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Abstract. For a class of stochastic restart algorithms we address the effect 

^ i _ i J ' of a nonzero level of randomization in maximizing the convergence rate for 

r'~S , general energy landscapes. The resulting characterization of the optimal level 

. 1 of randomization is investigated computationally for random as well as para- 

jS^ ' metric families of rugged energy landscapes. 
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1. Introduction 

>. 
l/^ , The question at the center of this short paper arose as a byproduct of the au- 

0^ ' thor's doctoral dissertation. In jThe95| the author studied a class of stochastic 

restart algorithms for global optimization and developed upper and lower bounds 
to their asymptotic convergence. These algorithms were then tested against se- 

^^ ] lected variants of the Simulated Annealing (SA) algorithm on a gamut of global 

f^ ' optimization problems. 

^^ I The first step in the analysis performed in |The95| was the martingale repre- 

■ ■ sentation of the moment generating function of certain exit times of the Markov 

e process describing the algorithm. Subsequently, this representation was used to 

establish asymptotic estimates of the Legendre transform of the moment generat- 
ing functions in question which finally led to the large deviations bounds on the 
convergence rate. 
. , The present paper deals with the study of the representation of the moment 

j^ ' generating function. Aided by the specific representation, we investigate the de- 

pendence of the asymptotic convergence rate on the level of global mixing. This 
level of randomization is an explicit design parameter for the class of algorithms 
we describe. 

The fundamental message in this paper is that a nonzero level of randomness 
often improves performance robustness in an unknown rugged landscape. The qual- 
itative behavior of the convergence rate with varying levels of randomness is largely 
insensitive to detailed characteristics of the energy landscape. Thus, when faced 
with a global optimization problem for which we have limited knowledge of the 
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2 THEODORE V. THEODOSOPOULOS 

energy landscape, a critical amount of randomization by design is likely to maxi- 
mize the expected convergence rate while maintaining a consistently competitive 
performance over a wide range of varied energy landscapes. 

The first section of the paper reviews the problem setup and the main results 
from |The95j . An appendix includes all the relevent nomenclature from |The95| 
which is used throughout the paper. The second section develops the criterion for 
the existence of a nonzero level of randomness which maximizes the convergence 
rate. This optimal level of randomness is represented as the solution to a pair of 
polynomial equations whose order is an increasing function of the relative depth 
and steepness of the global minimum well and the deepest strictly local minimum 
well. The next section exhibits the dependence of the convergence rate on the level 
of randomness for a set of random energy landscapes as well as three parametric 
families of energy landscapes. Finally, we briefly contrast the findings of our study 
to results regarding parallel implementations of simulated annealing by Azencott 
et.al. in |Aze92j . 

2. Class of Algorithms and Convergence Rate Estimates 

Let / : A" — > 7^ be a bounded, real-valued function on a discrete set X (the 
analysis in the paper applies irrespective of the finiteness of X but we choose to 
concentrate the discussion in this paper to the finite case which offers an ample 
set of applications). Let's assume that X is equipped with a probability measure 
fj. £ A4i(A') and a neighborhood structure {A/'(a;) C X, x ^ X}. The problem is to 
locate the set 

argmin/(a;) = {y e X : f{y) < f{x), Vx e X} . 

Unless otherwise noted, we will assume from now on that vaaxr^^x f{x) = 0. With 
this in mind, our problem can be rephrased as searching for /^^(O) or its e- 
approximation C{e). 

The family of stochastic restart algorithms we study is denoted by A and 
comprises of Markov processes on X with generators of the general form 

(2.1) [g,p] (x) ^ plA{x)<i>{Vf{x)) + (1 - p1a{x)) E''[0] - <i>(x), 

for every (f) : X ^f TZ, where p G [0, 1], A C A" and Vflx) = argminj^g7v'(2:) fiv)- We 
will say that the algorithm is generated by Q. From now on we will assume that 
'Df{-) is a well-defined mapping of X to itself; when 

card ( arg min /(y) ) > 1 
V yeM(x) J 

we will implicitly assume that a deterministic choice is made. 

Let {Xk, fc > 0} be the stochastic process generated by Q. For any e G f{X), 
let 

T(e) =inf{fc>0:Xfc £ /:(e)} . 

We are interested in evaluating the following performance measure for the class A 
of algorithms defined above: 

C(a,e,x)^ hm llogP,(T(e)>iV), 




SOME REMARKS ON THE OPTIMAL LEVEL OF RANDOMIZATION... 3 

where Px E A^i(A') is the measure mduced by the process when Xq = x. We see 
immediately that when x £ X2, P = ^ and N > d{x) we clearly have Px{t{£) > 
N) = and so C{Q, e, x) is not defined in that case. 

The main results in The95 are summarized in the following theorem: 

Theorem 2.1. Fix e G /(<Y). Let 

1 — ne* ■^-^ 
' j=o 

aVb 

(2.2) -(l-p)e«E(g(j)+P2W) (I:pV« 

j=o 

where we use the convention that X^i^o*^* ~ ^ when c < 0. Then the following 
statements hold: 

1. When X3 — 0, Q{-,p) has a unique positive root for all p G [0, 1]. When 
X3 ^ 0, Q{-,p) has a unique root in (0, — logp) for all p G [0, 1). Let the 
unique root defined above be denoted by Caitip)- 

2. Let f[y) ~ ye^"^ . The following set of equations 

has a unique solution in (l,oo)^. Let (a*, 7*) be this unique solution of 
lEM . Then 

(2.4) -a*7*Ccrit < E^ [C{g, e, x)] < -^t 

A numerical evaluation of l|2.3(l leads to the approximation a*j* ~ 8. 
The proof of Theorem 12 . II consists of four steps as presented in |The9 5': 

(i) We formulate a Dirichlet problem for Q on B{e) whose solution will provide 
a martingale representation of the moment-generating function for the 
distribution of r(e), i.e. E^ [e^'^''^''] , where E^ is shorthand for E^^. 

(ii) We solve the above Dirichlet problem. The martingale representation 
we obtain is valid for ^ < ^crit where ^crit is the unique root defined in 
Theorem O above. Specifically, we have E^ [e«^(')] = P{^,p)/Qi^,p), 
where P{-,p) is another polynomial in e^. 

(iii) The idea is to use Cramer's large deviations theorem (as described on 
pp. 22-31 of jStr93p in order to estimate the tails of the distribution of 
T(e). As a first step in that direction we have to estimate the Legendre 
transform I^iy) of log (E^ [e^^'-'^]). In jThe95| we prove that 

4(j/)=yS,(j/)-log(E-[e--(^)^(')]), 

where 

lim Ex{y) = ^crit, yx e X. 

y-»cxD 

Actually, the following rate estimate holds: 

Lemma 2.2. For every x (z X, there exists a positive constant c{x) 
such that 

_ / ^ . c(x) ,„ / 1 

y \y 
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Using the upper bound in Cramer's theorem we obtain the easy di- 
rection of Theorem 12. II namely, for each x £ X, 

C{g,e,x) < -^crit- 

(iv) In order to obtain the lower bound in Theorem 12. II we need to strengthen 
Cramer's lower bound. In particular, instead of using a variance es- 
timate in conjunction with Chebyshev's inequality, we apply Cramer's 
upper bound to the appropriately normalized random variable. Even 
though Cramer's upper bound uses Chebyshev's inequality, it proves to be 
stronger because of the maximization involved. This difference is sufficient 
to provide us with the desired result. 

3. Dependence of ^crit(^) on p 

The equation defining the critical exponent fcrit(^) can be rewritten as: 

(aV&)-|'2 j 

(3.1) Y. e^«""Ec,,p* = 

j = 4=0 

where Cji are constants which depend only on Pi{-), P2{') a-nd g(-). Differentiating 
(13.11) with respect to p we obtain 

(aVb)-|-2 j 

(3.2) Y. e^«""Ec,.z(p*)^ = 0. 

i=o 1=0 

Changing the order of summation makes H3.2|l into a polynomial in p* : 

(aV6)+2 (aV6)+2 

(3.3) E (P*y' E c,.e^«""=0. 

1=0 j=i 

Thus we can solve (|3.1|) and (|3.3() simultaneously for ^cUtiQ) and p* . 

Let Qi denote the algorithm obtained by (|2.1|l whenp = 1 and A = X\!F {Vf). 
Using H2.2|l we see that in this case (|3.1f) becomes (since ^crit > 0) 

a 

(3.4) Qi(0^1-e«EP2(jV«=0. 

This case corresponds to the minimum randomization that still guarantees asymp- 
totic convergence to C{e). 

On the other hand let A2 denote the subclass of algorithms obtained by (|2.1() 
when A = X. This condition implies that P2{-) = 0. Using (|2.2() in this case we see 
that H3.1II simplifies to 

(3.5) Q2(?,P) = , ^,-o'^yJJi^ ^Q 

1 — pes 

The algorithms in A2 choose between a steepest descent step and a global jump 
with fixed probability p irrespective of current location. In order to guarrantee 
asymptotic convergence to £(e) we must restrict A2 to have p < I. 
We will show when A2 is preferred to Qi and for which p. 



SOME REMARKS ON THE OPTIMAL LEVEL OF RANDOMIZATION... 5 

Lemma 3.1. Let 

p* = arg sup ^ciit(^)- 

If q{l) — (7(0) (1 — 17(0)) > 0, then, p* G (0, 1) and it solves equation |23) with Cji 
corresponding to A = X . 

Proof. Let Q2 = (l — pe^) Q2- Differentiating Q2 (Ccrit) — with respect to 
p we obtain 

,. aQ2 dQ2 

(3-6) % - -i (^-(^)'^) - -i (^-(^)'P) • 

Using H3.5|l we see that 

dQ2 
dp 

which imphes that 



(3.7) ^(e,p) = ^<z(jy-ie(^+i)« [j - (1 + j)p] 



(3.8) ^ (?„it(0),0) = '^^^^:,'^^"\^\''^^"^^ > 



^(e,0)=e«(e«g(l)-9(0)). 
Evaluating the above equation at ^ = ^crit(O) and noticing that 

?crit(0) = -log(l-g(0)) 
we conclude that 

where we have used the assumption in the statement of the lemma. Furthermore, 
1)3 .711 implies that 

(3.9) iini^(e,p) = -^q(jy^+i)«<0. 

p^i op ^ — ' 

3=0 

At the same time differentiating (|3.5|l with respect to ^ we obtain 

b 



^(C,P) = -e« + (1 -P)^g(j)(j + l)p^e(^+i)« 
which implies that 

(3.10) ^(^,o) = -e«(l-g(0))<0 
and 

(3.11) iim^(e,p) = -e«<0. 
Using lESJl, (ESJ), (|TTn|l and IJXTTJI we see that igSl) implies 

(3.12) ^(0) > and lim ^{p) < 0. 

dp p^i ap 

Lemma 3 on p. 25 of The95 tells us more generally that 

Lemma 3.2. ^ < for all ^ e [0, -logp) and p e [0,1). 
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From H3.6|l we observe that ^°"* < cx) <;=> -^ — 0. As a consequence of 

Lemma IT^ we conclude that Ccrit(p) is difFerentiable for aU p e [0, 1 ) and therefore 
continuous. Then, (|3.12() leads to the conclusion that there exists a. p* E (0, 1) 
which solves the equation -^^ = 0. In particular we can conclude that there is a 
p* € (0,1) which maximizes ^crit- Specializing ^'^.^ and (|3.3|l to this case we can 
represent 

sup ^crit {G),P* 
as a solution of the following set of equations: 

f E -=0 9(jy-^eO+i)« [^. _ (1 + ^-^p] ^ 

The following theorem contains the main result of this paper: 
Theorem 3.3. For any energy landscape which satisfies 

P2{j) 



a 



^^ d 



^(l-^W)^'^' 



i/iere exists a nonzero level of randomness which optimizes the asymptotic conver- 
gence rate of algorithms in class A2 U {^1}, that is 

sup ^crit(^) > Ccrit (^l)■ 
ae^2 

Proof. Let's consider the joint solvability of (|3.4|l and (|3.5() . Specifically, we 
obtain ^crit (Gi) by solving (|3.4|) and then we solve 

(3.13) Q2 (Ccrit (Gi) ,0 = 0. 

Let's assume that p E (0, 1) exists which solves H3.13|l . In this case, let 

( p + S if^(e„it(ai),p)>0 

(3.14) p=l p-6 if^(e„it(ai),p)<0 

I p otherwise 

for some sufficiently small (5 > 0. Then, by construction, (32 (Ccrit (^1) ,p) > 0. 
Theorem l2.1l guarrantees the existence of ^crit (p) which solves Q2 (-^p) — 0. Lemma 
13.21 then implies that 

sup Ccrit(^) > Ccrit {P) > 'Ccrit (^l) ■ 

ee^2 
Conversely, let's assume that there exists a p G (0, 1) such that ^crit (p) > 
Ccrit (^i)- Then, Lemma [3.21 implies that Q2 (^crit {Gi) ,p) > 0. The question then 
becomes one of existence of a solution to 

(3.15) Q2(e,-)=0 

in [0,e-«). From (ES)) we see that (92(^,0) = 1 - e« (1 - g(0)) and 

lim Q2{i,p) = -00. 

So, when ^ < — log (1 — (7(0)), there is always a solution to (|3.15fl . 
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There are three corner cases. First, if ^crit (Gi) = ^ log(l — q{0)), then, as we 
saw in the proof of LcmmaEl fcrit (Gi) = Cciit(O) < supgg_42 ?crit(0)- 

The second corner case is when p = 1. This means that Ump_>i ^crit(p) = 
^crit (^i)- But Lemma rS.ll teUs us that there exists a p* < 1 such that, ^crit (p*) > 
hnip^i Ccrit(p) and thus, supgg_4^ Cci-itiG) > Ccrit (^i)- 

Finally, the case ^ (^crit (^i) ,p) = implies that ^crit (^i) = Ccrit(p) < 

Therefore we have proved that 

(3.16) ^crit(ai)<-log(l-<7(0)) 

is a sufficient condition for supg;g_42 Ccrit(^) > ^crit (^i)- Plugging l|3.16|l into (|3.4() 
we see that 



Sf 



^'^^^ >1=^ snpUt{G)>^crit{Gi] 



q(0))^+' geA2 



3=0 

which completes the proof of the theorem. D 

Let P2"" — ™^ {P2{j)\j G [0, a]}. Notice that p™'" > 0' ^^^ therefore 



P2{j) ^ Pf'^ / 1 ^ 



5(1 



Also, observe that the general case {G G A) is always preferable to A2 and that 



Finally, let 



\im(A\A2)^l\^(A\{Gi}) = 



Phcst = arg sup Ccrit(0)- 



pe[04],ee^2u{gi} 
The above discussion leads to the following corollary. 

Corollary 3.4. Any energy landscape with a sufficiently deep strictly local 
minimum has a nonzero level of randomness which optimizes the asymptotic con- 
vergence rate of any algorithm in A. Specifically, for any fixed q{0) and p™™, 

log (1 + ^) 
- log(l-g(0)) 

is sufficient to guarantee that pbcst £ (0, 1). This further implies that 

arg sup Ccrit(p) e (0,1). 

pG[0,l],eG^ 

In practice, it is useful to note that the measure of depth used in Corollary 
13.41 is a function of the discretezation level. Specifically, given any energy function 
/, increasing the desired level of accuracy for the determination of C{e) leads to 
an effective increase in the energy landscape parameters a and b. Thus, Corollary 
13.41 assures us that a nonzero optimal level of randomization is ubiquitous in global 
optimization. 
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4. Dependence of p* on the Energy Landscape 

So far we have exhibited the dependence of the convergence rates in A on 
the level of randomization by design in the algorithm. We have seen that under 
a rather general condition on the energy landscape, there is a nonzero level of 
imposed randomization which maximizes the convergence rate, irrespective of the 
detailed structure of the energy landscape. In this section we focus on the depen- 
dence of the optimal level of randomization on global characteristics of the energy 
landscape. Specifically we are interested in the question of optimizing the design 
of the algorithm for particular energy landscapes. 

The outcome of this investigation is a surprising lack of sensitivity in the qual- 
itative characteristics of the way in which the optimal level of randomization varies 
across wide ranges of energy landscapes. Specifically we study four families of 
energy landscapes: random, exponential, polynomial, and logarithmic, where the 
latter three refer to the steepness of the energy wells in the landscape. Random 
landscapes were constructed by identifying q(-), pi{-) and P2{-) with appropriately 
normalized uniform random variables. On the other hand, the exponential, polyno- 
mial and logarithmic landscapes were constructing by assigning (after appropriate 
normalization) q{-),Pi{-),P2{-) ^ P'-' for /3 > 1, q{-) ■, Pi{-) , P2{-) ~ j" for a > 
and q{-)TPi(-),P2{-) ^ (logi)''' for 7 > respectively. The parameters /3, a and 
7 quantify the steepness of the basins of attraction in the respective parametric 
family. Specifically, in all three cases, the basins of attraction become steeper as 
the relevent parameter decreases in value. 

Another parameter that controls the geometry of the energy landscape is 

^A /^(r(e)) 



/i(>V(£(6))) 

All computational experiments reported below have been performed using the val- 
ues a = 20, 6 = 10 and c = 1000. 

FigureHillustrates the optimal level of randomization for a randomly generated 
energy landscape. We see that the optimal convergence rate can be significantly 
faster than the convergence rate corresponding to the minimum amount of random- 
ization or to a randomly chosen level of randomization. We can also observe that 
there is a range of p around p* for which A2 is preferable to Qi. Outside this range, 
Qi is preferable to any member of A2 ■ 

Figure|21illustrates the observed tradeoff when facing an unknown energy land- 
scape. Specifically, what is required is a high expected convergence rate and at 
the same time a low variance for the convergence rate. This is the performance 
robustness problem. A Monte Carlo simulation was performed with 100 indepen- 
dent randomly generated landscapes. In Figure El we have suppressed the third 
dimension which represents the variation of p. As p increases from to 1, we move 
along the curve in Figure [3 from the top right-hand corner to the bottom left-hand 
corner. Two regimes appear prominent: 

> a liquid regime in which the variance of the convergence rate decreases 
steadily with decreasing levels of randomization while the expected con- 
vergence rate remains relatively constant, and 

[> a solid regime in which the expected convergence rate decreases rapidly 
over a very short range of randomization levels while the variance of the 
convergence rate remains largely unchanged. 
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Figure 1. Convergence rate as a function of p for a randomly 
generated energy landscape 
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Figure 2 . Phase Transition for the Algorithm Class A2 
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Figure 3. Dependence of pbcst on /3 for exponential energy landscapes 

We refer to this empirically observed phenomenon as a Phase Transition. There ap- 
pears to exist a deeper relationship between the values of p that are poised between 
the two regimes and p*. At this juncture there is limited computational evidence 
in support of this conjecture. This conjecture appears to be related to the identifi- 
cation of the edge of chaos in |Kau95| as well as to the critical level of parallelism 
investigated in JMSK96 . Our approach of examining the performance robust- 
ness tradeoff in a population of randomly generated energy landscapes is related 
to the efficiency frontiers described in HLH97 and the discussion of ensemble of 
landscapes in Dit96 . 

Figures 13 0] and ]E\ show the way in which phcst varies with the parameters of 
exponential, polynomial and logarithmic energy landscapes respectively. We see 
that the qualitative characteristics in all three cases are indistinguishable. 

Similarly, Figures El H a-nd |H1 illustrate the way in which ^ciit (Pbest) varies 
with the parameters of exponential, polynomial and logarithmic energy landscapes 
respectively. Once more, the qualitative behavior of the optimal convergence rate 
is indistinguishable between the three cases. 

5. Conclusions 



Using the methodology developed in The95 we have studied the desirability 
of randomization by design to improve the convergence rate of global optimization 
algorithms. Theorem 13 . 31 describes a sufficient condition for the usefulness of such 
imposed randomness. This condition has been shown to hold generically for energy 
landscapes with sufficiently deep strictly local minima. We have also shown how to 
represent the optimal level of randomization as the solution to a pair of polynomial 
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Figure 4. Dependence of pbcst on a for polynomial energy landscapes 
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Figure 5. Dependence of pbest on 7 for logarithmic energy landscapes 



equations whose orders are related to the depths of the basins of attraction in the 
energy landscape in question. 
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Figure 6. Dependence of ^crit (pbcst) on (3 for exponential energy landscapes 
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Figure 7. Dependence of ^crit (Pbcst) on a for polynomial energy landscapes 
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Figure 8. Dependence of ^crit (Pbcst) on 7 for logarithmic energy landscapes 



The study of randomly generated energy landscapes has led to the characteri- 
zation of a Phase Transition associated with the performance robustness problem. 
Specifically, there is a narrow range of randomization levels which combine compet- 
itive expected convergence rates with minimal variance of that convergence rate. If 
we increase the level of randomization, wc fall into a liquid phase which increases 
the variance of the convergence rate. If on the other hand we decrease the level of 
randomization, we fall into a solid state which entails a rapid deterioration of the 
expected converence rate in return for a modest further reduction in variance. 

The investigation of the three parametric families of energy landscapes leads 
us to the following conclusions: 

\> A nonzero level of randomization by design is desirable in all cases. 

t> In all three cases, the optimal level of randomization is a monotonically 
increasing, convex function of the steepness of the basins of attraction (as 
captured by the parameters (3, a and 7 in the three families respectively) . 

\> Similarly, in all three cases, the optimal convergence rate is a monotoni- 
cally increasing, convex function of the steepness of the basins of attrac- 
tion. 

[> The geometric characteristics of the optimal level of randomization as well 
as the resulting optimal convergence rate are largely insensitive to drastic 
variations in the geometry of the energy landscape. 

This empirically observed robustness in the performance of appropriately random- 
ized gradient descent algorithms is a desirable property for systems facing complex, 
largely unknown nonconvex energy landscapes. 
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It is worthwhile to provide a comparison of our conclusions to results from 
parallelization attempts for Simulated Annealing (SA). Specifically, if P^; is tem- 
porarily used to denote the measure in path space induced by sequential SA, then 
we know from | Cat92) that: 

lim irif sup \ogPx{f{XN) > e) = -7^, 



where 



and 



Df = max I ^^ : xeT (Vf) n B{0) 



H{x) = max{/(z) - f{y) : y £ W{x), z G 9>V(a;)} . 

In |Aze92| . a variety of parallelization schemes are proposed for SA, all based on 
interacting multiple versions of the traditional, sequential SA. The convergence rate 
thus obtained becomes exponential with 

lim inf sup — \ogV^{f{X]^) > e) 



N^^l3i-)y^^x N ^ ^-"^ 2eDfK 

where K is the constant involved in the sequential SA convergence rate and P^; now 
refers to the path measure induced by the parallelized version of SA (see |Aze92p . 
Conceptually, the nonzero level of imposed randomness in the restart gradient de- 
scent algorithms discussed in this paper corresponds to non-monotonic annealing 
schedules in the context of SA. Furthermore, the Bernoulli restarts proposed here 
offer a generalization of the setup in Aze92 . Finally, the optimal expected time 
between restarts is found to be independent of the overall time allowed, a property 
which is consistent with a constant Bernoulli success probability. 

To recapitulate, the main findings of this research address the characterization 
of the desirabililty of a nonzero level of randomization. The optimized algorithm 
design is qualitatively invariant over a wide range of diverse energy landscapes. 
More work is required to develop a concrete understanding of the relationhip be- 
tween the optimal level of imposed randomization and the range of p which strikes 
a balance between a competitive expected converegence rate and low variability for 
the convergece rate. 

6. Appendix 

This Appendix includes the notation used throughout the paper. 

Definition 6.1. Fix e e [O^max^^x fi^)) ■ Let 
• Events up to time n 



• Points at and below energy level e 

Lie) 

• Points above energy level e 



C{e)^f-'{[0,e]). 



B{e)^X\C{e), 
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• Energy "well" (zone of attraction) of set A 

W{A) = IxeX: lim to^/] {x) e a\ , 

[ /c^oo J 

• Points outside the "well" of C{e) 

r(6)^A'\W(£(6)), 

• Set of local minima of / 

T (Vf) = {xeX : Vf{x) = x} , 

• Number of gradient descent steps needed to go from x to its closest local 
minimum or A:" \ ^4 (whichever is less) 

d{x) = mm {k > : [v'' f] (x) G (T (Vf) U C{e)) U {X \ A)} , 

• Mass assigned by ^ to points in d^^{j) and inside W (>C(e)) 

q{j)^f,{W{£ie))nd-\j)), 

• Mass assigned by /i to points x in d^^{j), outside W {C{e)) and when 
gradient descent from x leads to a local minimum in A 

Pi(j)=/i(r(6)nrf-i(j)n[p^/]"'(A)), 

• Mass assigned by /i to points x in d^^{j), outside W {C{e)) and when 
gradient descent from x leads to A" \ A 

P,U) ^ M (r(6) n d-\j) n [v\f] -' [x \ A)) , 

• Maximum d{x) outside W (-C(e)) 

a = max{j > : pi{j) V P2{j) > 0} , 



• 



Maximum d{x) inside W(£(e)) 



b = max {j > : q{j) > 0} 

From the above definitions, one notices a natural decomposition of X which we 
will use extensively: 



where 



X ~ Xi^j X2yj x^, 

Xi ^ ixcX : [l?'^(^)/] {x)cX\ a\ 



A 



X2 = w (/:(£)) 

Xs = [xeX: [i?^(")/] (x) e(An B(e))} . 

Some clarification is due regarding the boundary and the closure of a discrete set. 
We use the following definitions: 

Definition 6.2. For any A C f{X), 

df-\A)^{X\f-'{A)) fl Mix), 



xef-^A) 



A ,_i 



f-^{A) = ,r'iA)Udr\A). 
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